hinge_loss#

Hinge loss is a margin-based loss for classification. It’s the standard convex surrogate behind the (soft-margin) Support Vector Machine (SVM).

This notebook:

  • defines binary and multiclass hinge loss with consistent notation

  • builds intuition with Plotly plots

  • implements the loss (and a useful subgradient) from scratch in NumPy

  • uses hinge loss to optimize a simple linear classifier (primal SVM-style)

Quick import#

from sklearn.metrics import hinge_loss

Important: hinge_loss expects decision scores (real-valued margins), not probabilities.

import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio

from dataclasses import dataclass

from sklearn.datasets import make_blobs
from sklearn.metrics import hinge_loss as skl_hinge_loss
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC


pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)

rng = np.random.default_rng(42)


## 1) Binary hinge loss (definition)

Binary classification with a **real-valued score**:

- label: $y \in \{-1, +1\}$
- model score: $s = f(x) \in \mathbb{R}$
- prediction: $\hat{y} = \mathrm{sign}(s)$

The key quantity is the **(signed) margin**:

$$
 m = y\,s.
$$

- If $m > 0$, the example is classified correctly.
- Larger $m$ means “more confident” (further from the decision boundary).

The **hinge loss** is:

$$
\ell(y, s) = \max(0, 1 - y s) = \max(0, 1 - m).
$$

Average hinge loss over a dataset:

$$
L = \frac{1}{n}\sum_{i=1}^n \max(0, 1 - y_i s_i).
$$

### Relationship to 0–1 loss

The 0–1 loss is $\mathbb{1}[m \le 0]$ (wrong sign).

Hinge loss is a **convex upper bound**:

$$
\mathbb{1}[m \le 0] \;\le\; \max(0, 1 - m).
$$

So minimizing hinge loss tends to reduce classification errors while also encouraging a **margin** ($m \ge 1$ gives zero loss).
m = np.linspace(-3, 3, 600)

loss_01 = (m <= 0).astype(float)
loss_hinge = np.maximum(0.0, 1.0 - m)
loss_sq_hinge = np.maximum(0.0, 1.0 - m) ** 2

fig = go.Figure()
fig.add_trace(go.Scatter(x=m, y=loss_01, name="0-1 loss  𝟙[m≤0]", line=dict(dash="dash")))
fig.add_trace(go.Scatter(x=m, y=loss_hinge, name="hinge  max(0, 1-m)", line=dict(width=3)))
fig.add_trace(go.Scatter(x=m, y=loss_sq_hinge, name="squared hinge (variant)", line=dict(dash="dot")))

fig.add_vline(x=0, line_dash="dot", line_color="gray")
fig.add_vline(x=1, line_dash="dot", line_color="gray")

fig.update_layout(
    title="Loss as a function of margin  m = y·score",
    xaxis_title="margin m",
    yaxis_title="loss",
    legend_title="",
)
fig.show()


## 2) Intuition: which points are penalized?

Because $\ell(m)=\max(0, 1-m)$:

- **Misclassified** points ($m \le 0$) get loss $\ge 1$.
- **Correct but too close** to the boundary ($0 < m < 1$) still get *some* loss.
- **Confident** points ($m \ge 1$) get **zero** loss.

This is why hinge-based models often end up depending heavily on a subset of points (those with $m \le 1$), commonly called **support vectors** in the SVM context.
m_samples = np.linspace(-2.5, 2.5, 60)
loss_samples = np.maximum(0.0, 1.0 - m_samples)

category = np.where(
    m_samples <= 0,
    "misclassified (m ≤ 0)",
    np.where(m_samples < 1, "correct but within margin (0 < m < 1)", "confident (m ≥ 1)"),
)

fig = px.scatter(
    x=m_samples,
    y=loss_samples,
    color=category,
    title="Only points with margin m < 1 contribute to hinge loss",
)
fig.add_vline(x=0, line_dash="dot", line_color="gray")
fig.add_vline(x=1, line_dash="dot", line_color="gray")
fig.update_layout(xaxis_title="margin m", yaxis_title="hinge loss")
fig.show()


## 3) Multiclass hinge loss (Crammer–Singer)

For $K$ classes, assume a score vector:

$$
 s(x) \in \mathbb{R}^K, \quad s_k(x) = \text{score for class } k.
$$

If the true class is $y \in \{0,\dots,K-1\}$, the multiclass hinge loss is:

$$
\ell(y, s) = \max\big(0, 1 + \max_{j \ne y} s_j - s_y\big).
$$

It enforces a **margin** between the true class score and the best competing score:

$$
 s_y \ge \max_{j \ne y} s_j + 1 \quad \Rightarrow \quad \ell = 0.
$$

This is the formulation used by `sklearn.metrics.hinge_loss` when `pred_decision` is shaped `(n_samples, n_classes)`.
def _as_1d_float(x: np.ndarray) -> np.ndarray:
    x = np.asarray(x, dtype=float)
    if x.ndim != 1:
        raise ValueError(f"Expected a 1D array, got shape={x.shape}")
    return x


def binary_hinge_loss(
    y_true: np.ndarray,
    scores: np.ndarray,
    *,
    margin: float = 1.0,
    sample_weight: np.ndarray | None = None,
    reduction: str = "mean",
) -> float:
    """Binary hinge loss: mean_i max(0, margin - y_i * score_i).

    Accepts labels in {0,1} or {-1,+1}. `scores` are raw decision scores.
    """

    y = _as_1d_float(y_true)
    s = _as_1d_float(scores)
    if y.shape[0] != s.shape[0]:
        raise ValueError(
            f"y_true and scores must match in length, got {y.shape[0]} vs {s.shape[0]}"
        )

    uniques = set(np.unique(y).tolist())
    if uniques.issubset({0.0, 1.0}):
        y = np.where(y == 0.0, -1.0, 1.0)
    elif not uniques.issubset({-1.0, 1.0}):
        raise ValueError(
            f"For binary hinge loss, y_true must be in {{0,1}} or {{-1,1}}, got {sorted(uniques)}"
        )

    loss = np.maximum(0.0, margin - y * s)

    if sample_weight is not None:
        w = _as_1d_float(sample_weight)
        if w.shape[0] != loss.shape[0]:
            raise ValueError("sample_weight must have the same length as y_true")
        if reduction == "mean":
            return float(np.sum(w * loss) / np.sum(w))
        if reduction == "sum":
            return float(np.sum(w * loss))
        raise ValueError("reduction must be 'mean' or 'sum'")

    if reduction == "mean":
        return float(np.mean(loss))
    if reduction == "sum":
        return float(np.sum(loss))
    raise ValueError("reduction must be 'mean' or 'sum'")


def multiclass_hinge_loss(
    y_true: np.ndarray,
    scores: np.ndarray,
    *,
    margin: float = 1.0,
    sample_weight: np.ndarray | None = None,
    reduction: str = "mean",
) -> float:
    """Multiclass hinge loss (Crammer–Singer): mean_i max(0, margin + max_{j!=y} s_ij - s_i,y).

    `y_true` are integer class labels in [0, K-1]. `scores` has shape (n, K).
    """

    y = np.asarray(y_true)
    s = np.asarray(scores, dtype=float)
    if y.ndim != 1:
        raise ValueError(f"y_true must be 1D, got shape={y.shape}")
    if s.ndim != 2:
        raise ValueError(f"scores must be 2D, got shape={s.shape}")
    n, k = s.shape
    if y.shape[0] != n:
        raise ValueError("y_true and scores must match in n_samples")

    y = y.astype(int)
    if y.min() < 0 or y.max() >= k:
        raise ValueError(f"y_true values must be in [0, {k-1}]")

    true_scores = s[np.arange(n), y]

    s_other = s.copy()
    s_other[np.arange(n), y] = -np.inf
    max_other = np.max(s_other, axis=1)

    loss = np.maximum(0.0, margin + max_other - true_scores)

    if sample_weight is not None:
        w = _as_1d_float(sample_weight)
        if w.shape[0] != loss.shape[0]:
            raise ValueError("sample_weight must have the same length as y_true")
        if reduction == "mean":
            return float(np.sum(w * loss) / np.sum(w))
        if reduction == "sum":
            return float(np.sum(w * loss))
        raise ValueError("reduction must be 'mean' or 'sum'")

    if reduction == "mean":
        return float(np.mean(loss))
    if reduction == "sum":
        return float(np.sum(loss))
    raise ValueError("reduction must be 'mean' or 'sum'")
# --- Binary: compare against sklearn.metrics.hinge_loss ---

y_true_01 = np.array([0, 1, 0, 1])
score = np.array([-0.2, 0.5, 0.3, 1.2])

skl = skl_hinge_loss(y_true_01, score)
ours = binary_hinge_loss(y_true_01, score)
print("binary | sklearn:", skl)
print("binary | numpy :", ours)

# --- Multiclass: compare against sklearn.metrics.hinge_loss ---

y_true_mc = np.array([0, 1, 2])
scores_mc = np.array(
    [
        [2.0, 0.0, -1.0],
        [0.1, 0.2, 0.0],
        [-1.0, 0.0, 3.0],
    ]
)

skl_mc = skl_hinge_loss(y_true_mc, scores_mc)
ours_mc = multiclass_hinge_loss(y_true_mc, scores_mc)
print("multiclass | sklearn:", skl_mc)
print("multiclass | numpy :", ours_mc)
binary | sklearn: 0.65
binary | numpy : 0.65
multiclass | sklearn: 0.3
multiclass | numpy : 0.30000000000000004


## 4) Using hinge loss to optimize a linear classifier (soft-margin SVM style)

A common choice is a linear score function:

$$
 s_i = f(x_i) = w^T x_i + b.
$$

A soft-margin (primal) SVM objective is:

$$
J(w,b) = \frac{1}{2}\lVert w \rVert^2 + C\,\frac{1}{n}\sum_{i=1}^n \max\big(0, 1 - y_i(w^T x_i + b)\big).
$$

- The $\tfrac12\lVert w \rVert^2$ term is **L2 regularization** (prefers a wider margin).
- $C>0$ trades off margin size vs hinge penalties.

### Subgradient (what we need for gradient descent)

The hinge part is **not differentiable** at $m_i = 1$.
But it’s convex, so we can use a **subgradient**.

Let $m_i = y_i(w^T x_i + b)$ and define the “violators”:

$$
\mathcal{V} = \{i : m_i < 1\}.
$$

A convenient subgradient is:

$$
\nabla_w J = w - \frac{C}{n}\sum_{i\in\mathcal{V}} y_i x_i,
\qquad
\nabla_b J = - \frac{C}{n}\sum_{i\in\mathcal{V}} y_i.
$$

We’ll implement full-batch subgradient descent below.
@dataclass
class LinearSVMHistory:
    objective: list[float]
    mean_hinge: list[float]
    accuracy: list[float]


def linear_svm_objective(
    w: np.ndarray, b: float, X: np.ndarray, y: np.ndarray, *, C: float = 1.0
) -> tuple[float, float]:
    scores = X @ w + b
    hinge = np.maximum(0.0, 1.0 - y * scores)
    obj = 0.5 * float(w @ w) + C * float(np.mean(hinge))
    return obj, float(np.mean(hinge))


def linear_svm_subgrad(
    w: np.ndarray, b: float, X: np.ndarray, y: np.ndarray, *, C: float = 1.0
) -> tuple[np.ndarray, float]:
    n = X.shape[0]
    scores = X @ w + b
    margins = y * scores
    viol = margins < 1.0

    grad_w = w.copy()
    grad_b = 0.0

    if np.any(viol):
        grad_w -= (C / n) * (X[viol].T @ y[viol])
        grad_b = -(C / n) * float(np.sum(y[viol]))

    return grad_w, grad_b


def train_linear_svm_subgradient_descent(
    X: np.ndarray,
    y: np.ndarray,
    *,
    C: float = 1.0,
    lr: float = 0.2,
    n_epochs: int = 200,
    seed: int = 42,
) -> tuple[np.ndarray, float, LinearSVMHistory]:
    """Train a linear classifier with L2 + hinge using full-batch subgradient descent."""

    rng_local = np.random.default_rng(seed)
    w = rng_local.normal(scale=0.01, size=X.shape[1])
    b = 0.0

    hist = LinearSVMHistory(objective=[], mean_hinge=[], accuracy=[])

    for _ in range(n_epochs):
        obj, mean_hinge = linear_svm_objective(w, b, X, y, C=C)
        scores = X @ w + b
        y_pred = np.where(scores >= 0.0, 1.0, -1.0)
        acc = float(np.mean(y_pred == y))

        hist.objective.append(obj)
        hist.mean_hinge.append(mean_hinge)
        hist.accuracy.append(acc)

        grad_w, grad_b = linear_svm_subgrad(w, b, X, y, C=C)
        w = w - lr * grad_w
        b = b - lr * grad_b

    return w, b, hist


# --- Make a simple dataset ---
X_raw, y01 = make_blobs(n_samples=250, centers=2, cluster_std=1.8, random_state=42)
y_pm1 = np.where(y01 == 0, -1.0, 1.0)

scaler = StandardScaler()
X = scaler.fit_transform(X_raw)

w, b, hist = train_linear_svm_subgradient_descent(X, y_pm1, C=2.0, lr=0.15, n_epochs=220)

print("final objective:", hist.objective[-1])
print("final mean hinge:", hist.mean_hinge[-1])
print("final accuracy :", hist.accuracy[-1])
final objective: 0.5771675247777874
final mean hinge: 0.12760727393472507
final accuracy : 0.996
from plotly.subplots import make_subplots

epochs = np.arange(len(hist.objective))

fig = make_subplots(
    rows=3,
    cols=1,
    shared_xaxes=True,
    vertical_spacing=0.06,
    subplot_titles=(
        "Objective  (0.5||w||^2 + C·mean_hinge)",
        "Mean hinge loss",
        "Accuracy",
    ),
)

fig.add_trace(go.Scatter(x=epochs, y=hist.objective, name="objective"), row=1, col=1)
fig.add_trace(go.Scatter(x=epochs, y=hist.mean_hinge, name="mean hinge"), row=2, col=1)
fig.add_trace(go.Scatter(x=epochs, y=hist.accuracy, name="accuracy"), row=3, col=1)

fig.update_yaxes(title_text="value", row=1, col=1)
fig.update_yaxes(title_text="value", row=2, col=1)
fig.update_yaxes(title_text="", row=3, col=1, range=[0, 1.02])
fig.update_xaxes(title_text="epoch", row=3, col=1)

fig.update_layout(height=700, title="Training curves (full-batch subgradient descent)")
fig.show()
# Visualize decision boundary + margin band in 2D

x1_min, x1_max = float(X[:, 0].min() - 1.0), float(X[:, 0].max() + 1.0)
x2_min, x2_max = float(X[:, 1].min() - 1.0), float(X[:, 1].max() + 1.0)

xs = np.linspace(x1_min, x1_max, 200)

w0, w1 = float(w[0]), float(w[1])


def boundary_line(level: float) -> tuple[np.ndarray, np.ndarray]:
    """Return points (x1, x2) satisfying w0*x1 + w1*x2 + b = level."""
    if abs(w1) > 1e-10:
        x1 = xs
        x2 = (level - b - w0 * x1) / w1
        return x1, x2

    # Vertical line fallback
    x1 = np.full_like(xs, (level - b) / w0)
    x2 = np.linspace(x2_min, x2_max, xs.shape[0])
    return x1, x2


margins = y_pm1 * (X @ w + b)
support = margins <= 1.0 + 1e-12

fig = go.Figure()

# points by class
for cls, color in [(-1.0, "#1f77b4"), (1.0, "#d62728")]:
    mask = y_pm1 == cls
    fig.add_trace(
        go.Scatter(
            x=X[mask, 0],
            y=X[mask, 1],
            mode="markers",
            name=f"y={int(cls)}",
            marker=dict(size=8, color=color, line=dict(width=0)),
        )
    )

# highlight support vectors
fig.add_trace(
    go.Scatter(
        x=X[support, 0],
        y=X[support, 1],
        mode="markers",
        name="support (m ≤ 1)",
        marker=dict(size=14, color="rgba(0,0,0,0)", line=dict(width=2, color="black")),
    )
)

# decision boundary and margins
for level, name, dash, width, color in [
    (0.0, "decision f(x)=0", "solid", 3, "black"),
    (1.0, "+margin f(x)=+1", "dash", 2, "gray"),
    (-1.0, "-margin f(x)=-1", "dash", 2, "gray"),
]:
    x1, x2 = boundary_line(level)
    fig.add_trace(
        go.Scatter(
            x=x1,
            y=x2,
            mode="lines",
            name=name,
            line=dict(dash=dash, width=width, color=color),
        )
    )

fig.update_layout(
    title="Learned linear classifier with hinge loss (support vectors highlighted)",
    xaxis_title="x1 (scaled)",
    yaxis_title="x2 (scaled)",
)
fig.update_xaxes(range=[x1_min, x1_max])
fig.update_yaxes(range=[x2_min, x2_max])
fig.show()


## 5) The role of C (regularization trade-off)

In the objective

$$
\tfrac12\lVert w\rVert^2 + C\,\text{mean hinge},
$$

- **small `C`**: regularization dominates → wider margin, more tolerance for violations
- **large `C`**: hinge penalties dominate → tries harder to fit training points (narrower margin)

Below we train three models with different `C` values and compare the resulting decision boundaries.
from plotly.subplots import make_subplots

Cs = [0.2, 2.0, 20.0]
models: list[tuple[float, np.ndarray, float]] = []

for C in Cs:
    w_c, b_c, _ = train_linear_svm_subgradient_descent(
        X, y_pm1, C=C, lr=0.15, n_epochs=220, seed=42
    )
    models.append((C, w_c, b_c))

fig = make_subplots(rows=1, cols=len(Cs), subplot_titles=[f"C={C}" for C in Cs])

for col, (C, w_c, b_c) in enumerate(models, start=1):
    # data
    for cls, color in [(-1.0, "#1f77b4"), (1.0, "#d62728")]:
        mask = y_pm1 == cls
        fig.add_trace(
            go.Scatter(
                x=X[mask, 0],
                y=X[mask, 1],
                mode="markers",
                marker=dict(size=6, color=color),
                showlegend=(col == 1),
                name=f"y={int(cls)}",
            ),
            row=1,
            col=col,
        )

    # boundary (only f(x)=0 to keep it readable)
    w0, w1 = float(w_c[0]), float(w_c[1])
    if abs(w1) > 1e-10:
        x1 = xs
        x2 = (0.0 - b_c - w0 * x1) / w1
    else:
        x1 = np.full_like(xs, (0.0 - b_c) / w0)
        x2 = np.linspace(x2_min, x2_max, xs.shape[0])

    fig.add_trace(
        go.Scatter(
            x=x1,
            y=x2,
            mode="lines",
            line=dict(width=3, color="black"),
            showlegend=False,
            name="boundary",
        ),
        row=1,
        col=col,
    )

fig.update_layout(
    height=420,
    title="Effect of C on the learned decision boundary",
)
for col in range(1, len(Cs) + 1):
    fig.update_xaxes(title_text="x1", range=[x1_min, x1_max], row=1, col=col)
    fig.update_yaxes(title_text="x2", range=[x2_min, x2_max], row=1, col=col)

fig.show()

6) Practical usage: sklearn.metrics.hinge_loss#

sklearn.metrics.hinge_loss(y_true, pred_decision, ...) expects:

  • binary: pred_decision.shape == (n_samples,) (a real-valued decision score)

  • multiclass: pred_decision.shape == (n_samples, n_classes) (one score per class)

A common workflow:

  1. train a classifier that exposes decision_function

  2. compute pred_decision = model.decision_function(X)

  3. evaluate with hinge_loss(y_true, pred_decision)

Below we fit LinearSVC and compare sklearn’s hinge loss to our NumPy implementation.

clf = LinearSVC(C=2.0, dual=True, random_state=42)
clf.fit(X, y01)

dec = clf.decision_function(X)

skl = skl_hinge_loss(y01, dec)
ours = binary_hinge_loss(np.where(y01 == 0, -1.0, 1.0), dec)

print("sklearn hinge_loss:", skl)
print("numpy  hinge_loss:", ours)
sklearn hinge_loss: 0.012641047919911632
numpy  hinge_loss: 0.012641047919911632


## 7) Pros, cons, and when to use hinge loss

### Pros

- **Convex** (for linear models): optimization is well-behaved (no local minima).
- **Margin-aware**: doesn’t just separate classes; encourages a safety buffer.
- **Sparse dependence on data** (SVM view): only points with $m \le 1$ influence the solution.
- Often strong performance for **high-dimensional** classification (e.g., text with bag-of-words / TF-IDF).

### Cons

- **Non-smooth** at $m=1$ (requires subgradients or a smoothed variant).
- Produces **uncalibrated scores** (unlike logistic loss, it’s not a log-likelihood).
- Not ideal when you need **probabilities** or well-calibrated uncertainty.
- Can be sensitive to **label noise** near the boundary (like most margin-based methods).

### Good use cases

- Binary or multiclass classification when you care about **large margins**.
- Linear classification on large, sparse feature spaces (classic SVM territory).
- As a surrogate for the 0–1 loss when you need a convex objective.

8) Common pitfalls and diagnostics#

  • Use decision scores: hinge loss needs raw scores (e.g., decision_function), not probabilities.

  • Label encoding: math is cleanest with \(y\in\{-1,+1\}\); many libraries accept {0,1} but be explicit.

  • Feature scaling: for linear models with L2 regularization, scaling can strongly affect the margin and the effective regularization.

  • Class imbalance: hinge loss itself doesn’t fix imbalance; consider class weights or re-sampling.

  • Interpretation: a lower hinge loss generally means larger margins, but it’s not a calibrated probability of correctness.

Exercises#

  1. Implement squared hinge loss and compare optimization behavior (smoother gradients).

  2. Add L1 regularization and see how it changes sparsity in w.

  3. Compare hinge vs logistic loss on the same dataset: decision boundary, calibration, and outliers.

  4. Implement SGD (mini-batches) for the hinge objective and compare convergence.

References#

  • scikit-learn API: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.hinge_loss.html

  • Vapnik, The Nature of Statistical Learning Theory

  • Cortes & Vapnik (1995), Support-vector networks